Kwanza Tukule Data Analysis¶
This report showcases the completion of the Kwanza Tukule Data Analyst Assessment, designed to evaluate my technical, analytical, and problem-solving skills. Using the provided anonymized sales dataset, I applied data cleaning, preparation, and exploratory analysis techniques to uncover actionable insights. The analysis includes identifying trends, customer segmentation, forecasting, and anomaly detection to address key business challenges. Strategic recommendations are derived from the findings, focusing on product performance, customer retention, and operational efficiency. Additionally, I developed a dashboard to visualize key metrics, enabling better decision-making. This report demonstrates my ability to analyze data effectively and present meaningful insights in a structured, professional manner.
Importing Libraries¶
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from matplotlib import pyplot as plt
%matplotlib inline
#html export
import plotly.io as pio
pio.renderers.default = 'notebook'
Loading Dataset¶
df = pd.read_excel(r'Case Study Data - Read Only.xlsx')
df
| DATE | ANONYMIZED CATEGORY | ANONYMIZED PRODUCT | ANONYMIZED BUSINESS | ANONYMIZED LOCATION | QUANTITY | UNIT PRICE | |
|---|---|---|---|---|---|---|---|
| 0 | 2024-08-18 21:32:00 | Category-106 | Product-21f4 | Business-de42 | Location-1ba8 | 1 | 850.0 |
| 1 | 2024-08-18 21:32:00 | Category-120 | Product-4156 | Business-de42 | Location-1ba8 | 2 | 1910.0 |
| 2 | 2024-08-18 21:32:00 | Category-121 | Product-49bd | Business-de42 | Location-1ba8 | 1 | 3670.0 |
| 3 | 2024-08-18 21:32:00 | Category-76 | Product-61dd | Business-de42 | Location-1ba8 | 1 | 2605.0 |
| 4 | 2024-08-18 21:32:00 | Category-119 | Product-66e0 | Business-de42 | Location-1ba8 | 5 | 1480.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 333400 | 2024-01-09 20:49:00 | Category-119 | Product-e98d | Business-f9ff | Location-1979 | 1 | 1770.0 |
| 333401 | 2024-08-20 20:24:00 | Category-106 | Product-21f4 | Business-72bc | Location-689f | 1 | 850.0 |
| 333402 | 2024-08-20 20:24:00 | Category-76 | Product-6e9c | Business-72bc | Location-689f | 1 | 2575.0 |
| 333403 | 2024-10-10 18:19:00 | Category-91 | Product-523e | Business-2693 | Location-b27b | 1 | 1520.0 |
| 333404 | 2024-10-10 18:19:00 | Category-75 | Product-b31e | Business-2693 | Location-b27b | 1 | 4420.0 |
333405 rows × 7 columns
Attributes¶
DATE: Represents the date on which the transaction or activity occurred.
ANONYMIZED CATEGORY: Categorization of the product or service involved in the transaction, with the actual category names replaced by anonymized labels.
ANONYMIZED PRODUCT: Refers to the specific product associated with the transaction, with product names anonymized for confidentiality.
ANONYMIZED BUSINESS: Represents the business entity involved in the transaction.
ANONYMIZED LOCATION: The location where the transaction or activity occurred.
QUANTITY: The quantity of the product involved in the transaction, recorded as an integer.
UNIT PRICE: The price per unit of the product.
Data Cleaning and Preparation¶
# make column names and values uniform
df.columns = df.columns.str.lower().str.replace(' ', '_')
categorical_columns = df.dtypes[df.dtypes == 'object'].index
for c in categorical_columns:
df[c] = df[c].str.lower().str.replace(' ', '_')
df
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 0 | 2024-08-18 21:32:00 | category-106 | product-21f4 | business-de42 | location-1ba8 | 1 | 850.0 |
| 1 | 2024-08-18 21:32:00 | category-120 | product-4156 | business-de42 | location-1ba8 | 2 | 1910.0 |
| 2 | 2024-08-18 21:32:00 | category-121 | product-49bd | business-de42 | location-1ba8 | 1 | 3670.0 |
| 3 | 2024-08-18 21:32:00 | category-76 | product-61dd | business-de42 | location-1ba8 | 1 | 2605.0 |
| 4 | 2024-08-18 21:32:00 | category-119 | product-66e0 | business-de42 | location-1ba8 | 5 | 1480.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 333400 | 2024-01-09 20:49:00 | category-119 | product-e98d | business-f9ff | location-1979 | 1 | 1770.0 |
| 333401 | 2024-08-20 20:24:00 | category-106 | product-21f4 | business-72bc | location-689f | 1 | 850.0 |
| 333402 | 2024-08-20 20:24:00 | category-76 | product-6e9c | business-72bc | location-689f | 1 | 2575.0 |
| 333403 | 2024-10-10 18:19:00 | category-91 | product-523e | business-2693 | location-b27b | 1 | 1520.0 |
| 333404 | 2024-10-10 18:19:00 | category-75 | product-b31e | business-2693 | location-b27b | 1 | 4420.0 |
333405 rows × 7 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 333405 entries, 0 to 333404 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 333405 non-null datetime64[ns] 1 anonymized_category 333405 non-null object 2 anonymized_product 333405 non-null object 3 anonymized_business 333405 non-null object 4 anonymized_location 333405 non-null object 5 quantity 333405 non-null int64 6 unit_price 333397 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1), object(4) memory usage: 17.8+ MB
# duplicates
df.duplicated().sum()
3524
There are 3524 duplicates. We need to drop them
#dropping duplicates
df = df.drop_duplicates()
df.isna().sum()
date 0 anonymized_category 0 anonymized_product 0 anonymized_business 0 anonymized_location 0 quantity 0 unit_price 8 dtype: int64
The column 'unit_price' has 8 null values
df[df.unit_price.isna()]
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 108112 | 2024-07-03 17:53:00 | category-94 | product-3d7f | business-4fce | location-f37d | 2 | NaN |
| 150961 | 2024-12-16 18:33:00 | category-79 | product-dfc8 | business-8bbf | location-3fc0 | 1 | NaN |
| 151142 | 2024-12-22 14:42:00 | category-122 | product-15e0 | business-c575 | location-1979 | 3 | NaN |
| 272379 | 2024-06-27 12:15:00 | category-92 | product-ccbc | business-14b6 | location-1979 | 1 | NaN |
| 278284 | 2024-08-14 21:09:00 | category-101 | product-84a5 | business-4be1 | location-bb69 | 21 | NaN |
| 278384 | 2024-12-30 14:17:00 | category-95 | product-15f3 | business-1a74 | location-f37d | 1 | NaN |
| 310385 | 2024-03-31 14:03:00 | category-114 | product-9204 | business-c9dc | location-689f | 1 | NaN |
| 327152 | 2024-08-13 16:20:00 | category-107 | product-7eed | business-0d61 | location-1ba8 | 1 | NaN |
Lets try to see whether its the anonymized_product that determines the unit price so that we fill the nulls
df_product_3d7fdf = df[df['anonymized_product'] == 'product-3d7f']
df_product_3d7fdf
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 196 | 2024-09-25 13:05:00 | category-94 | product-3d7f | business-cdc1 | location-e2f8 | 1 | 860.0 |
| 287 | 2024-11-13 19:36:00 | category-94 | product-3d7f | business-7f77 | location-8959 | 1 | 860.0 |
| 319 | 2024-11-01 16:54:00 | category-94 | product-3d7f | business-2b91 | location-3fc0 | 1 | 860.0 |
| 372 | 2024-07-11 11:20:00 | category-94 | product-3d7f | business-2b24 | location-66f4 | 5 | 875.0 |
| 447 | 2024-12-09 19:35:00 | category-94 | product-3d7f | business-1b52 | location-7f37 | 1 | 885.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 332771 | 2024-10-15 20:01:00 | category-94 | product-3d7f | business-0ea3 | location-4ea1 | 1 | 860.0 |
| 332802 | 2024-10-21 19:04:00 | category-94 | product-3d7f | business-5415 | location-128a | 1 | 860.0 |
| 332842 | 2024-09-20 13:43:00 | category-94 | product-3d7f | business-5760 | location-689f | 1 | 860.0 |
| 333002 | 2024-10-08 17:15:00 | category-94 | product-3d7f | business-8603 | location-b27b | 1 | 860.0 |
| 333188 | 2024-11-08 21:20:00 | category-94 | product-3d7f | business-8bbf | location-3fc0 | 1 | 860.0 |
3164 rows × 7 columns
df_product_3d7fdf['unit_price'].unique()
array([860., 875., 885., 870., nan])
For product 'product-3d7f' we can see there's 3 different unit prices: 860.0, 875.0, 885.0, 870.0
df_product_3d7fdf[df_product_3d7fdf['unit_price'] == 860.0]
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 196 | 2024-09-25 13:05:00 | category-94 | product-3d7f | business-cdc1 | location-e2f8 | 1 | 860.0 |
| 287 | 2024-11-13 19:36:00 | category-94 | product-3d7f | business-7f77 | location-8959 | 1 | 860.0 |
| 319 | 2024-11-01 16:54:00 | category-94 | product-3d7f | business-2b91 | location-3fc0 | 1 | 860.0 |
| 565 | 2024-09-15 15:52:00 | category-94 | product-3d7f | business-f749 | location-3fc0 | 2 | 860.0 |
| 780 | 2024-10-03 15:08:00 | category-94 | product-3d7f | business-cdc1 | location-e2f8 | 1 | 860.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 332771 | 2024-10-15 20:01:00 | category-94 | product-3d7f | business-0ea3 | location-4ea1 | 1 | 860.0 |
| 332802 | 2024-10-21 19:04:00 | category-94 | product-3d7f | business-5415 | location-128a | 1 | 860.0 |
| 332842 | 2024-09-20 13:43:00 | category-94 | product-3d7f | business-5760 | location-689f | 1 | 860.0 |
| 333002 | 2024-10-08 17:15:00 | category-94 | product-3d7f | business-8603 | location-b27b | 1 | 860.0 |
| 333188 | 2024-11-08 21:20:00 | category-94 | product-3d7f | business-8bbf | location-3fc0 | 1 | 860.0 |
2344 rows × 7 columns
Lets try to see whether the anonymized category is responsible for the price 860.0
df_product_3d7fdf[df_product_3d7fdf['anonymized_category'] == 'category-94']
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 196 | 2024-09-25 13:05:00 | category-94 | product-3d7f | business-cdc1 | location-e2f8 | 1 | 860.0 |
| 287 | 2024-11-13 19:36:00 | category-94 | product-3d7f | business-7f77 | location-8959 | 1 | 860.0 |
| 319 | 2024-11-01 16:54:00 | category-94 | product-3d7f | business-2b91 | location-3fc0 | 1 | 860.0 |
| 372 | 2024-07-11 11:20:00 | category-94 | product-3d7f | business-2b24 | location-66f4 | 5 | 875.0 |
| 447 | 2024-12-09 19:35:00 | category-94 | product-3d7f | business-1b52 | location-7f37 | 1 | 885.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 332771 | 2024-10-15 20:01:00 | category-94 | product-3d7f | business-0ea3 | location-4ea1 | 1 | 860.0 |
| 332802 | 2024-10-21 19:04:00 | category-94 | product-3d7f | business-5415 | location-128a | 1 | 860.0 |
| 332842 | 2024-09-20 13:43:00 | category-94 | product-3d7f | business-5760 | location-689f | 1 | 860.0 |
| 333002 | 2024-10-08 17:15:00 | category-94 | product-3d7f | business-8603 | location-b27b | 1 | 860.0 |
| 333188 | 2024-11-08 21:20:00 | category-94 | product-3d7f | business-8bbf | location-3fc0 | 1 | 860.0 |
3164 rows × 7 columns
Even for product 'product-3d7f' with category 'category-94', we still have no supporting data to associate the unit price to.
Hence, its safe to drop the nulls
# dropping nulls
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 329873 entries, 0 to 333404 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 329873 non-null datetime64[ns] 1 anonymized_category 329873 non-null object 2 anonymized_product 329873 non-null object 3 anonymized_business 329873 non-null object 4 anonymized_location 329873 non-null object 5 quantity 329873 non-null int64 6 unit_price 329873 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1), object(4) memory usage: 20.1+ MB
Summary of Issues Identified and Steps Taken to Resolve Them:¶
- Duplicate Records:
- Issue Identified: The dataset initially contained 3524 duplicate records, which could skew the analysis and affect the accuracy of any insights derived from the data.
- Steps Taken: The duplicates were identified and dropped from the dataset using the .duplicated().sum() method and .drop_duplicates() to ensure each record is unique and accurately represents individual sales transactions.
- Null Values in 'unit_price' Column:
- Issue Identified: There were 8 null values in the unit_price column, which is critical as the price of the product must be available for proper calculations of sales value and other analyses.
- Steps Taken: The null values in the unit_price column were identified using .isnull().sum() and subsequently dropped from the dataset using .dropna() to remove any incomplete records that might lead to errors in subsequent analysis.
Final Dataset: After addressing these issues, the dataset now contains 329,873 non-null records, with no duplicate entries and all null values removed from the unit_price column. The dataset's columns now have consistent data types and no missing or redundant data, making it ready for further analysis.
Feature Engineering:
Creating the column: “Month-Year” (e.g., August 2024) from the “DATE” column.
df
| date | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 0 | 2024-08-18 21:32:00 | category-106 | product-21f4 | business-de42 | location-1ba8 | 1 | 850.0 |
| 1 | 2024-08-18 21:32:00 | category-120 | product-4156 | business-de42 | location-1ba8 | 2 | 1910.0 |
| 2 | 2024-08-18 21:32:00 | category-121 | product-49bd | business-de42 | location-1ba8 | 1 | 3670.0 |
| 3 | 2024-08-18 21:32:00 | category-76 | product-61dd | business-de42 | location-1ba8 | 1 | 2605.0 |
| 4 | 2024-08-18 21:32:00 | category-119 | product-66e0 | business-de42 | location-1ba8 | 5 | 1480.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 333398 | 2024-11-13 13:59:00 | category-121 | product-898d | business-0e99 | location-689f | 5 | 1360.0 |
| 333401 | 2024-08-20 20:24:00 | category-106 | product-21f4 | business-72bc | location-689f | 1 | 850.0 |
| 333402 | 2024-08-20 20:24:00 | category-76 | product-6e9c | business-72bc | location-689f | 1 | 2575.0 |
| 333403 | 2024-10-10 18:19:00 | category-91 | product-523e | business-2693 | location-b27b | 1 | 1520.0 |
| 333404 | 2024-10-10 18:19:00 | category-75 | product-b31e | business-2693 | location-b27b | 1 | 4420.0 |
329873 rows × 7 columns
# since column 'date' is already a datetime object,
# ------------Create the 'Month-Year' column---------------#
df2 = df.copy() # creating a copy for original cleaned df
# Create a 'Month-Year' column, keeping the date as the first of the month
df2['month-year'] = df['date'].dt.to_period('M').dt.to_timestamp()
# Format it for display, but still keep the datetime
df2['month-year'] = df2['month-year'].dt.strftime('%B %Y')
# Drop the 'date' column
df2.drop(columns=['date'], inplace=True)
# Move 'Month-Year' column to the first position
columns = ['month-year'] + [col for col in df2.columns if col != 'month-year']
df2 = df2[columns]
# Display the updated DataFrame to check the new column
df2
| month-year | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | |
|---|---|---|---|---|---|---|---|
| 0 | August 2024 | category-106 | product-21f4 | business-de42 | location-1ba8 | 1 | 850.0 |
| 1 | August 2024 | category-120 | product-4156 | business-de42 | location-1ba8 | 2 | 1910.0 |
| 2 | August 2024 | category-121 | product-49bd | business-de42 | location-1ba8 | 1 | 3670.0 |
| 3 | August 2024 | category-76 | product-61dd | business-de42 | location-1ba8 | 1 | 2605.0 |
| 4 | August 2024 | category-119 | product-66e0 | business-de42 | location-1ba8 | 5 | 1480.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 333398 | November 2024 | category-121 | product-898d | business-0e99 | location-689f | 5 | 1360.0 |
| 333401 | August 2024 | category-106 | product-21f4 | business-72bc | location-689f | 1 | 850.0 |
| 333402 | August 2024 | category-76 | product-6e9c | business-72bc | location-689f | 1 | 2575.0 |
| 333403 | October 2024 | category-91 | product-523e | business-2693 | location-b27b | 1 | 1520.0 |
| 333404 | October 2024 | category-75 | product-b31e | business-2693 | location-b27b | 1 | 4420.0 |
329873 rows × 7 columns
EDA (Exploratory Data Analysis)¶
Sales Overview¶
# Calculate the 'Value' column (quantity × unit_price)
df3 = df2.copy()
df3['Value'] = df3['quantity'] * df3['unit_price']
df3
| month-year | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | Value | |
|---|---|---|---|---|---|---|---|---|
| 0 | August 2024 | category-106 | product-21f4 | business-de42 | location-1ba8 | 1 | 850.0 | 850.0 |
| 1 | August 2024 | category-120 | product-4156 | business-de42 | location-1ba8 | 2 | 1910.0 | 3820.0 |
| 2 | August 2024 | category-121 | product-49bd | business-de42 | location-1ba8 | 1 | 3670.0 | 3670.0 |
| 3 | August 2024 | category-76 | product-61dd | business-de42 | location-1ba8 | 1 | 2605.0 | 2605.0 |
| 4 | August 2024 | category-119 | product-66e0 | business-de42 | location-1ba8 | 5 | 1480.0 | 7400.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 333398 | November 2024 | category-121 | product-898d | business-0e99 | location-689f | 5 | 1360.0 | 6800.0 |
| 333401 | August 2024 | category-106 | product-21f4 | business-72bc | location-689f | 1 | 850.0 | 850.0 |
| 333402 | August 2024 | category-76 | product-6e9c | business-72bc | location-689f | 1 | 2575.0 | 2575.0 |
| 333403 | October 2024 | category-91 | product-523e | business-2693 | location-b27b | 1 | 1520.0 | 1520.0 |
| 333404 | October 2024 | category-75 | product-b31e | business-2693 | location-b27b | 1 | 4420.0 | 4420.0 |
329873 rows × 8 columns
Calculating the sales for each unique category
# Group by anonymized_category
category_sales = df3.groupby('anonymized_category').agg(
Total_Quantity=('quantity', 'sum'),
Total_Value=('Value', 'sum')
).reset_index()
category_sales
| anonymized_category | Total_Quantity | Total_Value | |
|---|---|---|---|
| 0 | category-100 | 76824 | 134902751.0 |
| 1 | category-101 | 19564 | 35577822.0 |
| 2 | category-102 | 1786 | 464463.0 |
| 3 | category-104 | 1217 | 1557598.0 |
| 4 | category-105 | 1579 | 2690719.0 |
| 5 | category-106 | 6521 | 5932763.0 |
| 6 | category-107 | 2729 | 4170797.0 |
| 7 | category-108 | 9756 | 5101375.0 |
| 8 | category-109 | 1446 | 1263226.0 |
| 9 | category-110 | 10529 | 5483386.0 |
| 10 | category-111 | 6715 | 4387343.0 |
| 11 | category-113 | 741 | 1254083.0 |
| 12 | category-114 | 3 | 8600.0 |
| 13 | category-115 | 348 | 425360.0 |
| 14 | category-116 | 856 | 422745.0 |
| 15 | category-117 | 5 | 1550.0 |
| 16 | category-118 | 21 | 7560.0 |
| 17 | category-119 | 68332 | 103454819.0 |
| 18 | category-120 | 169715 | 319178743.0 |
| 19 | category-121 | 14669 | 22327643.0 |
| 20 | category-122 | 1223 | 3493480.0 |
| 21 | category-123 | 286 | 730730.0 |
| 22 | category-124 | 4 | 10060.0 |
| 23 | category-125 | 123 | 297060.0 |
| 24 | category-74 | 941 | 1927871.0 |
| 25 | category-75 | 151330 | 544658700.0 |
| 26 | category-76 | 71719 | 344939553.0 |
| 27 | category-77 | 28455 | 76741382.0 |
| 28 | category-78 | 9766 | 9792609.0 |
| 29 | category-79 | 2215 | 1184953.0 |
| 30 | category-81 | 142 | 72061.0 |
| 31 | category-82 | 4759 | 3930818.0 |
| 32 | category-83 | 2436 | 4039483.0 |
| 33 | category-84 | 11933 | 6798158.0 |
| 34 | category-85 | 22997 | 33762533.0 |
| 35 | category-86 | 8 | 3320.0 |
| 36 | category-89 | 238 | 136850.0 |
| 37 | category-90 | 15 | 15750.0 |
| 38 | category-91 | 20853 | 44152103.0 |
| 39 | category-92 | 6953 | 10468723.0 |
| 40 | category-94 | 23668 | 16750815.0 |
| 41 | category-95 | 4116 | 7466932.0 |
| 42 | category-96 | 1427 | 2249424.0 |
| 43 | category-97 | 2711 | 2628309.0 |
| 44 | category-98 | 2152 | 2519695.0 |
| 45 | category-99 | 1964 | 1589480.0 |
# graphing total sales and total value per anonymized category
import plotly.graph_objects as go
# bar chart for Total Quantity
bar = go.Bar(
x=category_sales["anonymized_category"],
y=category_sales["Total_Quantity"],
name="Total Quantity",
marker=dict(color="steelblue")
)
# line chart for Total Value
line = go.Scatter(
x=category_sales["anonymized_category"],
y=category_sales["Total_Value"],
name="Total Value",
mode="lines+markers", # Line with markers
line=dict(color="firebrick", width=3), # Line color and width
marker=dict(size=8), # Marker size
yaxis="y2" # Assign to secondary y-axis
)
# Combine both traces
fig = go.Figure(data=[bar, line])
# Update layout for dual y-axes
fig.update_layout(
title="Sales Overview by Category",
xaxis_title="Category",
yaxis=dict(
title="Total Quantity",
titlefont=dict(color="steelblue"), # Match bar chart color
tickfont=dict(color="steelblue") # Match bar chart color
),
yaxis2=dict(
title="Total Value",
titlefont=dict(color="firebrick"), # Match line chart color
tickfont=dict(color="firebrick"), # Match line chart color
overlaying="y", # Overlay on the same plot
side="right" # Position on the right
),
legend_title="Metrics",
barmode="group", # Group bars together
template="plotly_white" # Cleaner theme
)
# Show the figure
fig.show()
∴ From the combined Line and Bar graph plot, we can deduce 'category-120' has the highest amount of sales with over 169.7k total quantities sold.
However, when it comes to the total value gained, 'category-75' has the highest sales with over 544.6M in revenue
Calculating the sales for each unique business.
# Group by anonymized_business
business_sales = df3.groupby('anonymized_business').agg(
Total_Quantity=('quantity', 'sum'),
Total_Value=('Value', 'sum')
).reset_index()
business_sales
| anonymized_business | Total_Quantity | Total_Value | |
|---|---|---|---|
| 0 | business-0000 | 8 | 10445.0 |
| 1 | business-0005 | 1 | 2645.0 |
| 2 | business-0029 | 26 | 77340.0 |
| 3 | business-003d | 98 | 221761.0 |
| 4 | business-0072 | 127 | 225056.0 |
| ... | ... | ... | ... |
| 4795 | business-ffa9 | 3 | 6740.0 |
| 4796 | business-ffae | 6 | 10530.0 |
| 4797 | business-ffb1 | 266 | 438115.0 |
| 4798 | business-ffd2 | 37 | 67723.0 |
| 4799 | business-ffff | 110 | 110285.0 |
4800 rows × 3 columns
For the sales per Business comparison, 2 charts is appropriate due to the large number of values plotted
# Line chart for Total Value (green)
bar_value = go.Line(
x=business_sales["anonymized_business"],
y=business_sales["Total_Value"],
name="Total Value",
marker=dict(color="green"),
)
# layout
layout_value = go.Layout(
title="Total Value by Business",
xaxis_title="Business",
yaxis_title="Total Value",
template="plotly_white",
)
# figure
fig_value = go.Figure(data=[bar_value], layout=layout_value)
# Show the plot
fig_value.show()
c:\Program Files\Python312\Lib\site-packages\plotly\graph_objs\_deprecations.py:378: DeprecationWarning: plotly.graph_objs.Line is deprecated. Please replace it with one of the following more specific types - plotly.graph_objs.scatter.Line - plotly.graph_objs.layout.shape.Line - etc.
∴ From the Line plot, we can deduce 'business-978e' has the highest amount of sales with over 28M revenue.
# line chart for Total Quantity (blue)
bar_quantity = go.Line(
x=business_sales["anonymized_business"],
y=business_sales["Total_Quantity"],
name="Total Quantity",
marker=dict(color="blue"),
)
# layout
layout_quantity = go.Layout(
title="Total Quantity by Business",
xaxis_title="Business",
yaxis_title="Total Quantity",
template="plotly_white",
)
# figure
fig_quantity = go.Figure(data=[bar_quantity], layout=layout_quantity)
# Show the plot
fig_quantity.show()
∴ From the Line plot, we can deduce 'business-978e' has the highest amount of sales with over 13.9k total quantities sold.
Trends Over Time¶
To create a time series plot we need a datetime variable
# Create a copy of df
df4 = df.copy()
# Add 'Value' column
df4['value'] = df4['quantity'] * df4['unit_price']
# Create a new column 'Month-Year' in df4 (not df)
df4['month-year'] = df4['date'].dt.to_period('M').dt.to_timestamp()
# Format 'Month-Year' to display as "August 2024"
df4['month-year-formatted'] = df4['month-year'].dt.strftime('%B %Y')
# Reorder columns to make 'Month-Year-Formatted' the first column
df4 = df4[['month-year-formatted'] + [col for col in df4.columns if col != 'month-year-formatted']]
del df4['date']
df4.info()
<class 'pandas.core.frame.DataFrame'> Index: 329873 entries, 0 to 333404 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month-year-formatted 329873 non-null object 1 anonymized_category 329873 non-null object 2 anonymized_product 329873 non-null object 3 anonymized_business 329873 non-null object 4 anonymized_location 329873 non-null object 5 quantity 329873 non-null int64 6 unit_price 329873 non-null float64 7 value 329873 non-null float64 8 month-year 329873 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(2), int64(1), object(5) memory usage: 25.2+ MB
Our datetime object(month-year) is set to the first of the month for aggregation, for instance, any date in January is recorded as Jan 1, 2024
# Group the data by 'Month-Year' and calculate the monthly average of 'Quantity' and 'Value'
monthly_avg_sales = df4.groupby('month-year').agg({'quantity': 'mean', 'value': 'mean'}).reset_index()
# Convert 'Month-Year' to a string for plotting purposes
monthly_avg_sales['month-year'] = monthly_avg_sales['month-year'].astype(str)
# Group the data by 'Month-Year' and calculate the monthly averages
monthly_avg_sales = df4.groupby('month-year').agg({'quantity': 'mean', 'value': 'mean'}).reset_index()
# Convert 'Month-Year' to a string for plotting
monthly_avg_sales['month-year'] = monthly_avg_sales['month-year'].astype(str)
# Create a line plot with dual y-axes
fig = go.Figure()
# Add the line for Value
fig.add_trace(go.Scatter(
x=monthly_avg_sales['month-year'],
y=monthly_avg_sales['value'],
name='Average Value',
line=dict(color='red'),
yaxis='y1'
))
# Add the line for Quantity
fig.add_trace(go.Scatter(
x=monthly_avg_sales['month-year'],
y=monthly_avg_sales['quantity'],
name='Average Quantity',
line=dict(color='green'),
yaxis='y2'
))
# Update layout for dual y-axes
fig.update_layout(
title="Monthly Averages of Sales (Quantity and Value)",
xaxis=dict(title="Month-Year"),
yaxis=dict(
title="Average Value",
titlefont=dict(color="blue"),
tickfont=dict(color="blue"),
),
yaxis2=dict(
title="Average Quantity",
titlefont=dict(color="green"),
tickfont=dict(color="green"),
anchor="x",
overlaying="y",
side="right",
),
legend=dict(x=0.5, y=1.1, orientation="h"),
)
# Show the plot
fig.show()
Monthly averages are useful for understanding the typical performance per month, especially when there are outliers or variations in the data
For the monthly average sales(Both quantity and value), there is a downward slope,indicating a decreasing trend in the sales over time.
# Group the data by 'Month-Year' and calculate the monthly sum of 'Quantity' and 'Value'
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()
# Convert 'Month-Year' to a string for plotting purposes
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)
# Group the data by 'Month-Year' and calculate the monthly averages
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()
# Convert 'Month-Year' to a string for plotting
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)
# Create a line plot with dual y-axes
fig = go.Figure()
# Add the line for Value
fig.add_trace(go.Scatter(
x=monthly_sum_sales['month-year'],
y=monthly_sum_sales['value'],
name='Total Sum Value',
line=dict(color='red'),
yaxis='y1'
))
# Add the line for Quantity
fig.add_trace(go.Scatter(
x=monthly_sum_sales['month-year'],
y=monthly_sum_sales['quantity'],
name='Total Sum Quantity',
line=dict(color='green'),
yaxis='y2'
))
# Update layout for dual y-axes
fig.update_layout(
title="Monthly Sum of Sales (Quantity and Value)",
xaxis=dict(title="Month-Year"),
yaxis=dict(
title="Total Value",
titlefont=dict(color="blue"),
tickfont=dict(color="blue"),
),
yaxis2=dict(
title="Total Quantity",
titlefont=dict(color="green"),
tickfont=dict(color="green"),
anchor="x",
overlaying="y",
side="right",
),
legend=dict(x=0.5, y=1.1, orientation="h"),
)
# Show the plot
fig.show()
For understanding overall monthly performance, aggregate the data by summing the sales for each month.
The trend has visible fluctuations, with peaks in certain months(May, July and October), it might suggest seasonal effects.
Performance analysis¶
# top 5 most frequently purchased products (based on Quantity).
# Group by product and sum the quantities
top_products = df4.groupby('anonymized_product')['quantity'].sum().reset_index()
# Sort in descending order and select the top 5
top_5_products = top_products.sort_values(by='quantity', ascending=False).head(5)
# Display the result
top_5_products
| anonymized_product | quantity | |
|---|---|---|
| 338 | product-66e0 | 46957 |
| 750 | product-e805 | 42602 |
| 476 | product-8f75 | 37566 |
| 127 | product-29ee | 35940 |
| 213 | product-4156 | 28487 |
# Create a bar graph
fig = px.bar(
top_5_products,
x='anonymized_product',
y='quantity',
title='Top 5 Most Purchased Products (Quantity)',
labels={'anonymized_product': 'Product', 'quantity': 'Total Quantity'},
text='quantity' # Display the quantity on the bars
)
# Customize the layout
fig.update_layout(
xaxis_title='Product',
yaxis_title='Total Quantity',
template='plotly_white'
)
# Show the plot
fig.show()
# top 5 most valuable products (based on Value).
# Group by product and sum the quantities
top_value_products = df4.groupby('anonymized_product')['value'].sum().reset_index()
# Sort in descending order and select the top 5
top_value_products = top_value_products.sort_values(by='value', ascending=False).head(5)
# Display the result
top_value_products
| anonymized_product | value | |
|---|---|---|
| 750 | product-e805 | 262787281.0 |
| 476 | product-8f75 | 158797460.0 |
| 338 | product-66e0 | 70704225.0 |
| 127 | product-29ee | 68248274.0 |
| 213 | product-4156 | 56956007.0 |
# Create a bar graph
fig = px.bar(
top_value_products,
x='anonymized_product',
y='value',
title='Top 5 Most Valuable Products (Value)',
labels={'anonymized_product': 'Product', 'value': 'Total Quantity'},
text='value' # Display the quantity on the bars
)
# Customize the layout
fig.update_layout(
xaxis_title='Product',
yaxis_title='Total Value',
template='plotly_white'
)
# Show the plot
fig.show()
Advanced Analysis¶
Perform a segmentation analysis of businesses (Anonymized Business) based on their purchasing behavior:¶
df4
| month-year-formatted | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | value | month-year | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | August 2024 | category-106 | product-21f4 | business-de42 | location-1ba8 | 1 | 850.0 | 850.0 | 2024-08-01 |
| 1 | August 2024 | category-120 | product-4156 | business-de42 | location-1ba8 | 2 | 1910.0 | 3820.0 | 2024-08-01 |
| 2 | August 2024 | category-121 | product-49bd | business-de42 | location-1ba8 | 1 | 3670.0 | 3670.0 | 2024-08-01 |
| 3 | August 2024 | category-76 | product-61dd | business-de42 | location-1ba8 | 1 | 2605.0 | 2605.0 | 2024-08-01 |
| 4 | August 2024 | category-119 | product-66e0 | business-de42 | location-1ba8 | 5 | 1480.0 | 7400.0 | 2024-08-01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 333398 | November 2024 | category-121 | product-898d | business-0e99 | location-689f | 5 | 1360.0 | 6800.0 | 2024-11-01 |
| 333401 | August 2024 | category-106 | product-21f4 | business-72bc | location-689f | 1 | 850.0 | 850.0 | 2024-08-01 |
| 333402 | August 2024 | category-76 | product-6e9c | business-72bc | location-689f | 1 | 2575.0 | 2575.0 | 2024-08-01 |
| 333403 | October 2024 | category-91 | product-523e | business-2693 | location-b27b | 1 | 1520.0 | 1520.0 | 2024-10-01 |
| 333404 | October 2024 | category-75 | product-b31e | business-2693 | location-b27b | 1 | 4420.0 | 4420.0 | 2024-10-01 |
329873 rows × 9 columns
# Group by Anonymized Business and calculate required metrics
business_segmentation = df4.groupby('anonymized_business').agg(
total_quantity=('quantity', 'sum'),
total_value=('value', 'sum'),
transaction_frequency=('anonymized_business', 'count')
).reset_index()
# Sorting for analysis (optional)
top_businesses = business_segmentation.sort_values(by='total_value', ascending=False)
top_businesses
| anonymized_business | total_quantity | total_value | transaction_frequency | |
|---|---|---|---|---|
| 2906 | business-978e | 13991 | 28037358.0 | 2614 |
| 4781 | business-fe7d | 6743 | 26997121.0 | 653 |
| 1873 | business-6068 | 8214 | 16464195.0 | 1474 |
| 144 | business-07de | 6065 | 16258068.0 | 1160 |
| 2353 | business-7a03 | 6318 | 13968451.0 | 1147 |
| ... | ... | ... | ... | ... |
| 2616 | business-8654 | 1 | 55.0 | 1 |
| 3666 | business-c19b | 1 | 55.0 | 1 |
| 1478 | business-4b87 | 5 | 10.0 | 1 |
| 4060 | business-d78d | 1 | 2.0 | 1 |
| 4619 | business-f712 | 22 | 0.0 | 1 |
4800 rows × 4 columns
total_quantity: Calculates the total quantity purchased by summing the quantity column for each business.
total_value: Sums the value column to calculate the total revenue contributed by each business.
transaction_frequency: Uses the count of rows for each business to determine the frequency of transactions.
# scatter plot to analyze relationships between total quantity, total value, and transaction frequency
fig = px.scatter(
business_segmentation,
x='total_quantity',
y='total_value',
size='transaction_frequency',
color='transaction_frequency',
title='Customer Segmentation Based on Purchasing Behavior',
labels={
'total_quantity': 'Total Quantity Purchased',
'total_value': 'Total Value Contributed ($)',
'transaction_frequency': 'Transaction Frequency'
},
hover_data=['anonymized_business']
)
# Show the plot
fig.show()
From our scatter plot we can deduce, High-value businesses which are Businesses with high total value and frequent transactions.
High value businesses would be businesses at the top right quadrant of our graph having more quantities purchased and high value contributed.
Our Top High-value business would be business-978e with the highest total quantity purchased, highest value contributed and highest transaction frequency
Classifying businesses into 3 groups (e.g., High Value, Medium Value, Low Value)¶
business_segmentation
| anonymized_business | total_quantity | total_value | transaction_frequency | |
|---|---|---|---|---|
| 0 | business-0000 | 8 | 10445.0 | 8 |
| 1 | business-0005 | 1 | 2645.0 | 1 |
| 2 | business-0029 | 26 | 77340.0 | 6 |
| 3 | business-003d | 98 | 221761.0 | 31 |
| 4 | business-0072 | 127 | 225056.0 | 101 |
| ... | ... | ... | ... | ... |
| 4795 | business-ffa9 | 3 | 6740.0 | 3 |
| 4796 | business-ffae | 6 | 10530.0 | 5 |
| 4797 | business-ffb1 | 266 | 438115.0 | 105 |
| 4798 | business-ffd2 | 37 | 67723.0 | 22 |
| 4799 | business-ffff | 110 | 110285.0 | 107 |
4800 rows × 4 columns
# trying to see the distribution of total_value to calssify the businesses
fig = px.box(business_segmentation, y='total_value')
# Show the figure
fig.show()
# Calculate quartiles
Q1 = business_segmentation['total_value'].quantile(0.25)
Q2 = business_segmentation['total_value'].quantile(0.50)
Q3 = business_segmentation['total_value'].quantile(0.75)
# Compute IQR and upper whisker (excluding extreme outliers)
IQR = Q3 - Q1
upper_whisker = Q3 + 1.5 * IQR
# Define function to categorize values into quartiles
def categorize(value):
if value <= Q1:
return "0 - Q1"
elif Q1 < value <= Q2:
return "Q1 - Q2"
elif Q2 < value <= Q3:
return "Q2 - Q3"
elif Q3 < value <= upper_whisker:
return "Q3 - Q4"
else:
return "Outlier"
# Apply function to categorize values
business_segmentation['quartile_category'] = business_segmentation['total_value'].apply(categorize)
# Ensure quartile categories are ordered
quartile_order = ["0 - Q1", "Q1 - Q2", "Q2 - Q3", "Q3 - Q4", "Outlier"]
# Convert to categorical with ordered=True
business_segmentation['quartile_category'] = pd.Categorical(
business_segmentation['quartile_category'],
categories=quartile_order,
ordered=True
)
# Count occurrences, now sorted correctly
quartile_counts = business_segmentation['quartile_category'].value_counts().sort_index()
# Display results
print("Count of values in each quartile:\n")
print(quartile_counts)
Count of values in each quartile: quartile_category 0 - Q1 1200 Q1 - Q2 1200 Q2 - Q3 1200 Q3 - Q4 526 Outlier 674 Name: count, dtype: int64
# Get the minimum and maximum values
min_value = business_segmentation['total_value'].min()
max_value = business_segmentation['total_value'].max()
# Display results
print(f"Minimum Value: {min_value}")
print(f"First Quartile (Q1): {Q1}")
print(f"Second Quartile (Median/Q2): {Q2}")
print(f"Third Quartile (Q3): {Q3}")
print(f"Fourth Quartile (Upper Whisker): {upper_whisker}")
print(f"Maximum Value: {max_value}")
Minimum Value: 0.0 First Quartile (Q1): 9207.5 Second Quartile (Median/Q2): 41869.0 Third Quartile (Q3): 234042.0 Fourth Quartile (Upper Whisker): 571293.75 Maximum Value: 28037358.0
Based on the distribution of
# Define function to classify businesses based on their total_value
def classify_business(value):
if value > Q3:
return "high"
elif Q2 < value <= Q3:
return "medium"
else:
return "low"
# Apply the classification function to the dataframe
business_segmentation['value_category'] = business_segmentation['total_value'].apply(classify_business)
business_segmentation
| anonymized_business | total_quantity | total_value | transaction_frequency | quartile_category | value_category | |
|---|---|---|---|---|---|---|
| 0 | business-0000 | 8 | 10445.0 | 8 | Q1 - Q2 | low |
| 1 | business-0005 | 1 | 2645.0 | 1 | 0 - Q1 | low |
| 2 | business-0029 | 26 | 77340.0 | 6 | Q2 - Q3 | medium |
| 3 | business-003d | 98 | 221761.0 | 31 | Q2 - Q3 | medium |
| 4 | business-0072 | 127 | 225056.0 | 101 | Q2 - Q3 | medium |
| ... | ... | ... | ... | ... | ... | ... |
| 4795 | business-ffa9 | 3 | 6740.0 | 3 | 0 - Q1 | low |
| 4796 | business-ffae | 6 | 10530.0 | 5 | Q1 - Q2 | low |
| 4797 | business-ffb1 | 266 | 438115.0 | 105 | Q3 - Q4 | high |
| 4798 | business-ffd2 | 37 | 67723.0 | 22 | Q2 - Q3 | medium |
| 4799 | business-ffff | 110 | 110285.0 | 107 | Q2 - Q3 | medium |
4800 rows × 6 columns
Business Value Segmentation & Engagement Strategies¶
1. High-Value Businesses (Q3 - Q4 & Outliers) 💰
Businesses that contribute the highest total value.
📌 Recommendations:
- Loyalty Programs: Offer discounts, rebates, or exclusive deals for bulk purchases.
- Early Access: Provide early access to new products or priority stock allocation.
- Co-Marketing Initiatives: Partner for promotions, sponsorships, or cross-selling opportunities.
2. Medium-Value Businesses (Q2 - Q3) ⚖️ Businesses with moderate purchases and potential for growth.
📌 Recommendations:
- Upselling & Cross-Selling: Introduce relevant product bundles or complementary items.
- Flexible Payment Terms: Offer customized payment plans to increase order volume.
- Marketing Support: Provide digital assets, promotional materials, or co-branded advertisements.
- Incentivized Growth Plans: Introduce tiered rewards—higher purchases unlock better benefits.
3. Low-Value Businesses (0 - Q2) 📉 Businesses with minimal engagement or small purchases.
📌 Recommendations:
- Special Promotions: Offer introductory discounts or limited-time deals to encourage higher orders.
- Automated Follow-Ups: Use email campaigns to remind them of product benefits or new stock.
- Survey & Feedback: Understand why their purchases are low and identify pain points.
- Self-Service Support: Provide a digital knowledge base or chatbot assistance to answer queries.
- Referral Incentives: Encourage referrals with discounts or bonuses for bringing in new customers.
Forecasting¶
# forecasting with arima
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
# Ensure the Month-Year column is in datetime format
df4['month-year'] = pd.to_datetime(df4['month-year'])
# Aggregate total sales by month
monthly_sales = df4.groupby('month-year')['value'].sum()
# Fit an ARIMA model (order can be optimized based on ACF/PACF)
model = ARIMA(monthly_sales, order=(2,1,2))
model_fit = model.fit()
# Forecast for the next 3 months
forecast = model_fit.forecast(steps=3)
# Generate future date range
future_dates = pd.date_range(start=monthly_sales.index[-1], periods=4, freq='M')[1:]
# Convert dates to "Month-Year" format (e.g., "January 2024")
formatted_dates = future_dates.strftime('%B %Y')
# Create Plotly figure
fig = go.Figure()
# Add historical sales data
fig.add_trace(go.Scatter(
x=monthly_sales.index.strftime('%B %Y'), # Format historical x-axis labels
y=monthly_sales.values,
mode='lines',
name='Historical Sales',
line=dict(color='blue')
))
# Add forecasted sales data
fig.add_trace(go.Scatter(
x=formatted_dates,
y=forecast.values,
mode='lines+markers',
name='Forecasted Sales',
line=dict(color='red', dash='dash')
))
# Update layout
fig.update_layout(
title="3-Month Sales Forecast",
xaxis_title="Month-Year",
yaxis_title="Total Sales (Value)",
xaxis=dict(tickmode='array', tickvals=monthly_sales.index.strftime('%B %Y')),
template="plotly_white"
)
# Show the plot
fig.show()
c:\Program Files\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. c:\Program Files\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. c:\Program Files\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used. c:\Program Files\Python312\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters. c:\Program Files\Python312\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters. C:\Users\Deninjo\AppData\Local\Temp\ipykernel_10284\468917369.py:18: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
Forecasted Sales (Red Dashed Line)
The forecast starts high in January 2025 and shows a gradual decline through March 2025. This suggests that the model predicts a downward trend in sales over the forecast period.
Anomaly Detection¶
# Identify any unusual spikes or drops in sales performance (Quantity or Value)
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()
# Convert 'Month-Year' to a string for plotting purposes
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)
# Group the data by 'Month-Year' and calculate the monthly averages
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()
# Convert 'Month-Year' to a string for plotting
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)
# Create a line plot with dual y-axes
fig = go.Figure()
# Add the line for Value
fig.add_trace(go.Scatter(
x=monthly_sum_sales['month-year'],
y=monthly_sum_sales['value'],
name='Total Sum Value',
line=dict(color='red'),
yaxis='y1'
))
# Add the line for Quantity
fig.add_trace(go.Scatter(
x=monthly_sum_sales['month-year'],
y=monthly_sum_sales['quantity'],
name='Total Sum Quantity',
line=dict(color='green'),
yaxis='y2'
))
# Update layout for dual y-axes
fig.update_layout(
title="Monthly Sum of Sales (Quantity and Value)",
xaxis=dict(title="Month-Year"),
yaxis=dict(
title="Total Value",
titlefont=dict(color="blue"),
tickfont=dict(color="blue"),
),
yaxis2=dict(
title="Total Quantity",
titlefont=dict(color="green"),
tickfont=dict(color="green"),
anchor="x",
overlaying="y",
side="right",
),
legend=dict(x=0.5, y=1.1, orientation="h"),
)
# Show the plot
fig.show()
From the graph, here are the notable spikes and drops in sales performance:
- January 2024 (Spike in Total Sum Value and Quantity)
- Sales value peaked at around 190M, and quantity was also relatively high.
- Possible reason: Seasonal demand, holiday sales, or a major promotional event.
- February to March 2024 (Steep Drop in Both Value and Quantity)
- A sharp decline in both total value and total quantity, reaching the lowest point in March.
- Possible reason: Post-holiday season slowdown, reduced consumer spending, or stock shortages.
- May 2024 (Spike in Both Value and Quantity)
- Significant recovery from March, with a sharp increase in sales.
- Possible reason: A new product launch, marketing campaign, or seasonal sales trend.
- July 2024 (Another Peak in Value and Quantity)
- Both metrics reached one of their highest points in the year.
- Possible reason: Mid-year sales event, discounts, or increased demand in specific product categories.
- November 2024 (Sharp Drop in Value, Moderate Drop in Quantity)
- The quantity remains relatively high compared to previous dips, but the value drops significantly.
- Possible reason: Discounted sales leading to higher volumes but lower revenue per unit.
Correlation Analysis¶
df4
| month-year-formatted | anonymized_category | anonymized_product | anonymized_business | anonymized_location | quantity | unit_price | value | month-year | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | August 2024 | category-106 | product-21f4 | business-de42 | location-1ba8 | 1 | 850.0 | 850.0 | 2024-08-01 |
| 1 | August 2024 | category-120 | product-4156 | business-de42 | location-1ba8 | 2 | 1910.0 | 3820.0 | 2024-08-01 |
| 2 | August 2024 | category-121 | product-49bd | business-de42 | location-1ba8 | 1 | 3670.0 | 3670.0 | 2024-08-01 |
| 3 | August 2024 | category-76 | product-61dd | business-de42 | location-1ba8 | 1 | 2605.0 | 2605.0 | 2024-08-01 |
| 4 | August 2024 | category-119 | product-66e0 | business-de42 | location-1ba8 | 5 | 1480.0 | 7400.0 | 2024-08-01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 333398 | November 2024 | category-121 | product-898d | business-0e99 | location-689f | 5 | 1360.0 | 6800.0 | 2024-11-01 |
| 333401 | August 2024 | category-106 | product-21f4 | business-72bc | location-689f | 1 | 850.0 | 850.0 | 2024-08-01 |
| 333402 | August 2024 | category-76 | product-6e9c | business-72bc | location-689f | 1 | 2575.0 | 2575.0 | 2024-08-01 |
| 333403 | October 2024 | category-91 | product-523e | business-2693 | location-b27b | 1 | 1520.0 | 1520.0 | 2024-10-01 |
| 333404 | October 2024 | category-75 | product-b31e | business-2693 | location-b27b | 1 | 4420.0 | 4420.0 | 2024-10-01 |
329873 rows × 9 columns
# correlation matrix between nmerical columns
X = df4[[ 'quantity', 'unit_price','value']]
# Calculate the correlation matrix
correlation_matrix = X.corr()
# Create a heatmap using seaborn
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)
# Add a title
plt.title("Correlation Heatmap of Selected Features", fontsize=16)
# Show the heatmap
plt.show()
Sales Performance is Primarily Driven by Volume: The strong correlation between quantity and value suggests selling more units is the best way to increase revenue.
Pricing has a Limited Effect on Revenue: Since unit price has a weak correlation with total value, raising prices may not significantly boost overall sales revenue.
Marketing Strategies Should Focus on Increasing Quantity Sold:
- Offering bulk discounts or promotions could drive more sales.
- Targeting high-demand products with aggressive sales campaigns may be more effective than price increases.
Strategic Insights and Recommendations¶
Product Strategy¶
# Group by Anonymized Category and calculate required metrics
category_segmentation = df4.groupby('anonymized_category').agg(
total_quantity=('quantity', 'sum'),
total_value=('value', 'sum'),
transaction_frequency=('anonymized_category', 'count')
).reset_index()
# Sorting for analysis (optional)
top_categories = category_segmentation.sort_values(by='total_value', ascending=False)
top_categories.head()
| anonymized_category | total_quantity | total_value | transaction_frequency | |
|---|---|---|---|---|
| 25 | category-75 | 151330 | 544658700.0 | 47769 |
| 26 | category-76 | 71719 | 344939553.0 | 41473 |
| 18 | category-120 | 169715 | 319178743.0 | 61076 |
| 0 | category-100 | 76824 | 134902751.0 | 18720 |
| 17 | category-119 | 68332 | 103454819.0 | 31634 |
| 27 | category-77 | 28455 | 76741382.0 | 13586 |
| 38 | category-91 | 20853 | 44152103.0 | 15891 |
| 1 | category-101 | 19564 | 35577822.0 | 6636 |
| 34 | category-85 | 22997 | 33762533.0 | 13015 |
| 19 | category-121 | 14669 | 22327643.0 | 7281 |
| 40 | category-94 | 23668 | 16750815.0 | 10859 |
| 39 | category-92 | 6953 | 10468723.0 | 4977 |
| 28 | category-78 | 9766 | 9792609.0 | 6689 |
| 41 | category-95 | 4116 | 7466932.0 | 3106 |
| 33 | category-84 | 11933 | 6798158.0 | 8628 |
| 5 | category-106 | 6521 | 5932763.0 | 4919 |
| 9 | category-110 | 10529 | 5483386.0 | 6064 |
| 7 | category-108 | 9756 | 5101375.0 | 2534 |
| 10 | category-111 | 6715 | 4387343.0 | 5090 |
| 6 | category-107 | 2729 | 4170797.0 | 2107 |
| 32 | category-83 | 2436 | 4039483.0 | 1849 |
| 31 | category-82 | 4759 | 3930818.0 | 2503 |
| 20 | category-122 | 1223 | 3493480.0 | 567 |
| 4 | category-105 | 1579 | 2690719.0 | 1186 |
| 43 | category-97 | 2711 | 2628309.0 | 1597 |
| 44 | category-98 | 2152 | 2519695.0 | 1569 |
| 42 | category-96 | 1427 | 2249424.0 | 1041 |
| 24 | category-74 | 941 | 1927871.0 | 158 |
| 45 | category-99 | 1964 | 1589480.0 | 1336 |
| 3 | category-104 | 1217 | 1557598.0 | 999 |
| 8 | category-109 | 1446 | 1263226.0 | 947 |
| 11 | category-113 | 741 | 1254083.0 | 308 |
| 29 | category-79 | 2215 | 1184953.0 | 1313 |
| 21 | category-123 | 286 | 730730.0 | 177 |
| 2 | category-102 | 1786 | 464463.0 | 1036 |
| 13 | category-115 | 348 | 425360.0 | 242 |
| 14 | category-116 | 856 | 422745.0 | 618 |
| 23 | category-125 | 123 | 297060.0 | 63 |
| 36 | category-89 | 238 | 136850.0 | 176 |
| 30 | category-81 | 142 | 72061.0 | 87 |
| 37 | category-90 | 15 | 15750.0 | 15 |
| 22 | category-124 | 4 | 10060.0 | 4 |
| 12 | category-114 | 3 | 8600.0 | 3 |
| 16 | category-118 | 21 | 7560.0 | 14 |
| 35 | category-86 | 8 | 3320.0 | 7 |
| 15 | category-117 | 5 | 1550.0 | 4 |
# scatter plot to analyze relationships between total quantity, total value, and transaction frequency
fig = px.scatter(
category_segmentation,
x='total_quantity',
y='total_value',
size='transaction_frequency',
color='transaction_frequency',
title='Category Segmentation Based on Purchasing Behavior',
labels={
'total_quantity': 'Total Quantity Purchased',
'total_value': 'Total Value Contributed ($)',
'transaction_frequency': 'Transaction Frequency'
},
hover_data=['anonymized_category']
)
# Show the plot
fig.show()
Based on my analysis, I would recommend product category 'category-75' to prioritize for marketing campaigns as it has the highest total value contributed at around 544.6M with 151.3k total quantities purchased
Customer Retention¶
# Identify businesses that have reduced their purchase frequency over time
# Convert to datetime if not already
df4['month-year-formatted'] = pd.to_datetime(df4['month-year-formatted'])
# Format month-year as "Month Year" (e.g., "August 2024")
df4['month-year-formatted'] = df4['month-year-formatted'].dt.strftime('%B %Y')
# Group by business and formatted month-year, count purchases in each month
purchase_trend = df4.groupby(['anonymized_business', 'month-year-formatted']).size().reset_index(name='purchase_count')
# Calculate the trend in purchase frequency per business
purchase_trend['prev_purchase_count'] = purchase_trend.groupby('anonymized_business')['purchase_count'].shift(1)
purchase_trend['change'] = purchase_trend['purchase_count'] - purchase_trend['prev_purchase_count']
# Identify businesses with a declining trend
declining_customers = purchase_trend[purchase_trend['change'] < 0]
declining_customers
C:\Users\Deninjo\AppData\Local\Temp\ipykernel_10284\855654940.py:2: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
| anonymized_business | month-year-formatted | purchase_count | prev_purchase_count | change | |
|---|---|---|---|---|---|
| 3 | business-0000 | November 2024 | 1 | 2.0 | -1.0 |
| 11 | business-0072 | December 2024 | 10 | 23.0 | -13.0 |
| 13 | business-0072 | June 2024 | 8 | 14.0 | -6.0 |
| 14 | business-0072 | May 2024 | 3 | 8.0 | -5.0 |
| 17 | business-0072 | September 2024 | 10 | 22.0 | -12.0 |
| ... | ... | ... | ... | ... | ... |
| 20675 | business-ffd2 | June 2024 | 3 | 8.0 | -5.0 |
| 20676 | business-ffd2 | March 2024 | 2 | 3.0 | -1.0 |
| 20678 | business-ffd2 | September 2024 | 4 | 5.0 | -1.0 |
| 20681 | business-ffff | October 2024 | 25 | 48.0 | -23.0 |
| 20682 | business-ffff | September 2024 | 18 | 25.0 | -7.0 |
7267 rows × 5 columns
Since the dataset on declining customers is huge, lets try to get some key insights from the dataset
# Summary Statistics on Purchase Change
declining_customers['change'].describe()
count 7267.000000 mean -10.530893 std 13.789538 min -310.000000 25% -13.000000 50% -6.000000 75% -3.000000 max -1.000000 Name: change, dtype: float64
# Businesses with the Biggest Declines
top_decliners = declining_customers.nsmallest(25, 'change')
top_decliners
| anonymized_business | month-year-formatted | purchase_count | prev_purchase_count | change | |
|---|---|---|---|---|---|
| 12637 | business-978e | February 2024 | 115 | 425.0 | -310.0 |
| 17010 | business-cf0d | November 2024 | 26 | 313.0 | -287.0 |
| 16745 | business-cb1f | September 2024 | 219 | 451.0 | -232.0 |
| 2941 | business-245e | July 2024 | 25 | 196.0 | -171.0 |
| 10187 | business-78a8 | November 2024 | 6 | 160.0 | -154.0 |
| 9081 | business-6baf | January 2024 | 2 | 155.0 | -153.0 |
| 8047 | business-6011 | September 2024 | 18 | 164.0 | -146.0 |
| 12645 | business-978e | September 2024 | 278 | 415.0 | -137.0 |
| 3006 | business-24c3 | September 2024 | 37 | 171.0 | -134.0 |
| 4763 | business-3955 | July 2024 | 27 | 149.0 | -122.0 |
| 13826 | business-a8bd | March 2024 | 23 | 141.0 | -118.0 |
| 525 | business-07de | December 2024 | 6 | 114.0 | -108.0 |
| 534 | business-07de | September 2024 | 111 | 216.0 | -105.0 |
| 7188 | business-5677 | September 2024 | 106 | 210.0 | -104.0 |
| 3147 | business-263c | December 2024 | 1 | 104.0 | -103.0 |
| 10341 | business-7ab4 | November 2024 | 24 | 125.0 | -101.0 |
| 15483 | business-bc52 | September 2024 | 34 | 129.0 | -95.0 |
| 5891 | business-468e | November 2024 | 59 | 151.0 | -92.0 |
| 10271 | business-7a03 | February 2024 | 18 | 110.0 | -92.0 |
| 18098 | business-de42 | March 2024 | 17 | 107.0 | -90.0 |
| 13421 | business-a3fe | September 2024 | 47 | 136.0 | -89.0 |
| 18907 | business-e997 | December 2024 | 43 | 130.0 | -87.0 |
| 2525 | business-1f55 | November 2024 | 17 | 103.0 | -86.0 |
| 8076 | business-6068 | March 2024 | 47 | 133.0 | -86.0 |
| 2713 | business-2197 | February 2024 | 7 | 91.0 | -84.0 |
# Trend Over Time (Monthly Summary)
# Ensure 'month-year-formatted' is in datetime format
declining_customers['month-year-formatted'] = pd.to_datetime(declining_customers['month-year-formatted'], format='%B %Y')
# Group by month-year and sum changes
monthly_decline = declining_customers.groupby('month-year-formatted')['change'].sum()
# Sort by date (ensures chronological order)
monthly_decline = monthly_decline.sort_index(ascending=True)
# Convert back to string format for proper labeling
monthly_decline.index = monthly_decline.index.strftime('%B %Y')
# Create an interactive bar chart using Plotly
fig = px.bar(
x=monthly_decline.index,
y=monthly_decline.values,
labels={'x': 'Month-Year', 'y': 'Total Decline in Purchases'},
title="Total Monthly Decline in Purchases",
)
# Improve layout
fig.update_layout(
xaxis_title="Month-Year",
yaxis_title="Total Decline",
xaxis=dict(tickangle=-45), # Rotate x-axis labels for better readability
bargap=0.3 # Adjust bar spacing
)
# Show the interactive graph
fig.show()
C:\Users\Deninjo\AppData\Local\Temp\ipykernel_10284\4292986156.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Re-engagement Strategies
From the top 25 identified businesses with declining purchases, you can implement the following targeted strategies:
📌 Personalized Email & Promotions
- Offer discounts or special deals for returning customers.
- Send a personalized reminder about their past purchases and recommend new products.
📌 Loyalty Programs & Incentives
- Create a points-based loyalty system for frequent buyers.
- Offer exclusive rewards to customers who return after a long time.
📌 Customer Feedback & Surveys
- Send a short survey asking why they’ve reduced their purchases.
- Offer an incentive (e.g., a discount) for completing the survey.
📌 Targeted Ads & Retargeting
- Use Facebook, Google Ads, or LinkedIn to show personalized ads to customers who haven’t purchased recently.
- Implement email reminders with abandoned cart recovery tactics.
Open-Ended Problem¶
If the dataset were 10 times larger, scalability would become a key concern for data storage, processing, and analysis.
Here’s how I would optimize each aspect:
- Data Storage Optimization
Database Choices
- Switch to a Columnar Database: If the data is primarily for analytical purposes, use columnar databases like Amazon Redshift, ClickHouse, or Google BigQuery. These are optimized for aggregations and queries.
Partitioning & Indexing: - Partitioning: Split data by time (e.g., monthly or yearly) or category (e.g., region, product type). This speeds up queries.
Move to Distributed Storage if Needed - Cloud Storage Solutions (e.g., AWS S3, Google Cloud Storage, Azure Blob) → Store raw data in a data lake for scalable access.
- Distributed Databases (e.g., Amazon Redshift, Snowflake, Apache Cassandra)
→ If frequent querying is needed, a distributed SQL-based system can parallelize queries efficiently.
- Data Processing Optimization
Batch vs. Streaming Processing
- Batch Processing (ETL/ELT) → Use Apache Spark, Dask, or Polars instead of Pandas for handling large datasets.
- Streaming Processing (If Real-Time Forecasting is Needed)
→ Use Kafka + Spark Streaming or Flink for real-time sales data updates.
Parallel & Distributed Computing - Leverage Multi-Core Processing: Use Dask, Vaex, or Modin instead of pandas for handling large DataFrames.
- Distributed Computing Frameworks:
Apache Spark (PySpark) for large-scale processing.
Google BigQuery or Snowflake for cloud-based distributed queries.
- Data Analysis Optimization
Efficient Aggregations & Queries
- Precompute Aggregates: Store summary tables (e.g., sales per region/month) to avoid recalculating frequently.
- Materialized Views:
Create cached views of common queries to speed up performance.
Hardware & Cloud Scaling - Scale-Up vs. Scale-Out:
Vertical Scaling (More RAM, SSDs) for faster processing.
Horizontal Scaling (Distributed clusters, parallel processing) for massive datasets.
- Forecasting Model Optimization
Efficient Machine Learning Approaches
- Use gradient boosting (e.g., XGBoost, LightGBM) or deep learning (e.g., LSTMs, Transformers) for better time-series forecasting.
- If real-time updates are needed, use incremental learning models (e.g., River, online learning algorithms).
Final Thoughts
If the dataset grows 100 times larger, transitioning to cloud-based solutions (AWS, GCP, Azure) with serverless data warehouses (BigQuery, Snowflake) would be the best approach.